
Photo by the author
# Entry
While you’ll mostly find Jupyter notebooks, Pandas, and graphical dashboards in state-of-the-art data science, they don’t always provide the level of control you may need. On the other hand, command-line tools may not be as intuitive as you would like, but they are powerful, lightweight, and much faster at performing the specific tasks they were designed for.
In this article I have tried to create a balance between usability, maturity and power. Here you’ll find classics that are almost impossible to avoid, as well as more state-of-the-art additions that fill gaps or optimize performance. You could even call it Version 2025 of the must-have CLI tools list. For those who aren’t familiar with CLI tools but want to learn, I’ve included an additional resources section in the summary, so scroll down before you start incorporating these tools into your workflow.
# 1. curling
curl this is my favorite way to create HTTP requests such as GET, POST or PUT; file download; and sending/receiving data via protocols such as HTTP or FTP. Perfect for fetching data from APIs or downloading datasets. It can be easily integrated with data ingestion pipelines to retrieve payloads in JSON, CSV or other formats. The best thing about curl is that it comes pre-installed on most Unix systems, so you can start using it right away. However, its syntax (especially regarding headers, content, and authentication) can be verbose and error-prone. When using more complicated APIs, you may prefer an easier-to-use Python wrapper or library, but knowing curl is still a significant advantage for quick testing and debugging.
# 2. jq
jq is a lightweight JSON processor that allows you to query, filter, transform and nicely print JSON data. Since JSON is the dominant format for APIs, logs, and data exchange, jq is necessary for extracting and transforming JSON in pipelines. Works like “Pandas for JSON in the shell”. The biggest advantage is that it provides a concise language for handling complicated JSON, but it can take some time to learn its syntax, and extremely immense JSON files may require additional attention to memory management.
# 3. csvkit
csvkit is a suite of CSV-centric command-line tools for transforming, filtering, aggregating, combining, and exploring CSV files. You can select and reorder columns, subset rows, combine multiple files, convert from one format to another, and even run SQL-like queries against CSV data. csvkit understands CSV quoting semantics and headers, making it more secure than general text processing tools for this format. Relying on Python means performance may be lower for very immense datasets, and some complicated queries may be easier in Pandas or SQL. If you prefer speed and effective memory usage, consider csvtk tool kit.
# 4. qwk / sed
Link (sed): https://www.gnu.org/software/sed/manual/sed.html
Classic Unix tools such as ok AND sed remain irreplaceable when manipulating text. awk is powerful in pattern scanning, field-based transformations, and brisk aggregations, while sed specializes in text replacement, deletion, and transformation. These tools are brisk and lightweight, making them ideal for quick work on pipelines. However, their syntax may be unintuitive. As the logic develops, readability deteriorates and you may switch to a scripting language. Additionally, when dealing with nested or hierarchical data (e.g., nested JSON), these tools have circumscribed expressiveness.
# 5. in parallel
In parallel with GNU speeds up your workflow by running multiple processes in parallel. Many data tasks can be “mapped” to pieces of data. Let’s say you need to perform the same transformation on hundreds of files – paralleling can spread the work across processor cores, speed up processing, and manage task control. However, there are I/O bottlenecks and system overhead to be aware of, and quote/escape can be challenging in complicated pipelines. For cluster-scale or distributed workloads, consider resource-aware scheduling (e.g. Spark, Dask, Kubernetes).
# 6. ripgrep (rg)
ripgrep (rg) is a brisk recursive search tool designed for speed and efficiency. I respect that .gitignore by default and ignores hidden or binary files, making it much faster than established grep. It’s perfect for quickly searching code bases, log directories or configuration files. Since it ignores specific paths by default, you may need to adjust the flags to search everything, and this isn’t always available by default on every platform.
# 7. data connection
data mash provides numeric, text and statistical operations (sum, average, median, grouping, etc.) directly in the shell via stdin or files. It’s lightweight and useful for quick aggregations without running a heavier tool like Python or R, making it ideal for shell-based ETL or exploratory analysis. However, it is not designed to handle very immense data sets or complicated analyzes where specialized tools perform better. Additionally, grouping very immense numbers may require a significant amount of memory.
# 8. HTOP
HTOP is an interactive system monitor and process viewer that provides real-time insight into CPU, memory and I/O utilization in each process. When running weighty pipelines or training models, htop is extremely useful for tracking resource consumption and identifying bottlenecks. It is more user-friendly than the established one topbut being interactive means it doesn’t fit well with automated scripts. It may also be missing in minimal server configurations and is not a replacement for specialized performance tools (profilers, metrics dashboards).
# 9. git
git is a distributed version control system necessary to track changes to code, scripts and compact data resources. When it comes to reproducibility, collaboration, branching experiments, and rollbacks, git is the standard. Integrates with deployment pipelines, CI/CD tools, and notebooks. Its disadvantage is that it is not intended for versioning immense binary data, which Git LFS, DVC or specialized systems are better suited for. The branching and merging workflow also requires learning.
# 10. tmux/screen
Terminal multiplexers such as touch AND screen allow you to run multiple terminal sessions in a single window, disconnect and reconnect sessions, and resume work after SSH disconnection. They are necessary if you want to run long experiments or pipelines remotely. Although tmux is recommended due to its lively development and flexibility, its configuration and keybindings can be challenging for novices, and in minimal environments it may not be installed by default.
# Summary
If you’re just starting out, I would recommend mastering the “core four”: curl, jq, awk/sed and git. They are used everywhere. Over time, you’ll discover domain-specific CLIs like SQL clients, DuckDB command-line interfaceOr Data set incorporate into your workflow. To learn more, check out the following resources:
- Command-Line Data Analysis by Jeroen Janssens
- The art of the command line on GitHub
- Mark Pearl’s Bash Cheat Sheet
- Communities like Unix & command line subreddits often reveal useful tricks and up-to-date tools that will expand your toolbox over time.
Kanwal Mehreen is a machine learning engineer and technical writer with a deep passion for data science and the intersection of artificial intelligence and medicine. She is co-author of the e-book “Maximizing Productivity with ChatGPT”. As a 2022 Google Generation Scholar for APAC, she promotes diversity and academic excellence. She is also recognized as a Teradata Diversity in Tech Scholar, a Mitacs Globalink Research Scholar, and a Harvard WeCode Scholar. Kanwal is a staunch supporter of change and founded FEMCodes to empower women in STEM fields.
