Command-line statistics for beginner data scientists

Photo by the editor

# Entry

If you’re just starting out in data science, you may think you need tools like Python, R, or other software to perform statistical analysis of your data. However, the command line is already a powerful set of statistical tools.

Command-line tools can often process immense sets of data faster than loading it into memory-intensive applications. They can be easily written and automated. Moreover, these tools run on any Unix system without installing anything.

In this article, you will learn how to perform basic statistical operations directly from your terminal, using only the built-in Unix tools.

🔗 Here it is Bash script on GitHub. Collaborative coding is highly recommended to fully understand the concepts.

To follow this tutorial you will need:

You will need a Unix-like environment (Linux, macOS or Windows with WSL).
We will only operate standard Unix tools that are already installed.

To get started, open a terminal.

# Configuring sample data

Before we can analyze data, we need a dataset. Create a straightforward CSV file representing your website’s daily traffic by running the following command in your terminal:

cat > traffic.csv << EOF
date,visitors,page_views,bounce_rate
2024-01-01,1250,4500,45.2
2024-01-02,1180,4200,47.1
2024-01-03,1520,5800,42.3
2024-01-04,1430,5200,43.8
2024-01-05,980,3400,51.2
2024-01-06,1100,3900,48.5
2024-01-07,1680,6100,40.1
2024-01-08,1550,5600,41.9
2024-01-09,1420,5100,44.2
2024-01-10,1290,4700,46.3
EOF

This will create a recent file called movement.csv with headers and ten lines of sample data.

# Data mining

// Counting rows in a data set

One of the first things to identify in a dataset is the number of records it contains. The wc (word count) command with -l the flag counts the number of lines in the file:

The output displays: 11 traffic.csv (11 lines total minus 1 header = 10 lines of data).

// Viewing your data

Before starting the calculations, it is worth checking the data structure. The head the command displays the first few lines of the file:

Displays the first 5 lines, allowing you to preview the data.

date,visitors,page_views,bounce_rate
2024-01-01,1250,4500,45.2
2024-01-02,1180,4200,47.1
2024-01-03,1520,5800,42.3
2024-01-04,1430,5200,43.8

// Single column extraction

To work with specific columns in a CSV file, operate method cutting command with delimiter and field number. The following command extracts the visitors column:

cut -d',' -f2 traffic.csv | tail -n +2

This extracts field 2 (guest column) using cutAND tail -n +2 skips the header line.

# Calculating measures of central tendency

// Finding the middle ground (average)

The average is the sum of all values divided by the number of values. We can calculate this by extracting the target column and then using ok accumulate values:

cut -d',' -f2 traffic.csv | tail -n +2 | awk '{sum+=$1; count++} END {print "Mean:", sum/count}'

The awk the command accumulates the sum and count as it processes each line and then divides them into END block.

Then we calculate the median and mode.

// Finding the median

The median is the middle value when sorting a set of data. In the case of an even number of values, it is the average of the two middle values. First sort the data and then find the middle:

cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | awk '{arr[NR]=$1; count=NR} END {if(count%2==1) print "Median:", arr[(count+1)/2]; else print "Median:", (arr[count/2]+arr[count/2+1])/2}'

Sorts data numerically using sort -nstores the values in an array and then finds the middle value (or the average of the two middle values if the number is even).

// Finding the mode

Mod is the most common value. We find this by sorting, counting duplicates and identifying which value appears most often:

cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | uniq -c | sort -rn | head -n 1 | awk '{print "Mode:", $2, "(appears", $1, "times)"}'

Sorts values and counts duplicates uniq -csorts by frequency in reverse order and selects the highest result.

# Calculating dispersion (or dispersion) measures

// Finding the maximum value

To find the largest value in a dataset, we compare each value and track the maximum:

awk -F',' 'NR>1 {if($2>max) max=$2} END {print "Maximum:", max}' traffic.csv

This omits the with header NR>1compares each value to the current maximum and updates it if a higher value is found.

// Finding the minimum value

Similarly, to find the smallest value, initialize the minimum from the first row of data and update it when smaller values are found:

awk -F',' 'NR==2 {min=$2} NR>2 {if($2

Run the above commands to get the maximum and minimum values.

// Finding the minimum and maximum values

Instead of running two separate commands, we can find both the minimum and maximum in one run:

awk -F',' 'NR==2 {min=$2; max=$2} NR>2 {if($2max) max=$2} END {print "Min:", min, "Max:", max}' traffic.csv

This single-pass approach initializes both variables from the first row and then updates each of them independently.

// Calculation of (population) standard deviation

Standard deviation measures how much values differ from the mean. To get the full population, operate the following formula:

awk -F',' 'NR>1 {sum+=$2; sumsq+=$2*$2; count++} END {mean=sum/count; print "Std Dev:", sqrt((sumsq/count)-(mean*mean))}' traffic.csv

This will accumulate the sum and sum of squares and then apply the formula: ( sqrt{frac{sum x^2}{N} - mu^2} ), giving the result:

// Calculating the sample standard deviation

When working with a sample rather than the full population, operate Bessel Amendment (divide by (n-1)) for unbiased sample estimates:

awk -F',' 'NR>1 {sum+=$2; sumsq+=$2*$2; count++} END {mean=sum/count; print "Sample Std Dev:", sqrt((sumsq-(sum*sum/count))/(count-1))}' traffic.csv

This gives:

// Calculating variance

The variance is the square of the standard deviation. This is another measure of spread useful in many statistical calculations:

awk -F',' 'NR>1 {sum+=$2; sumsq+=$2*$2; count++} END {mean=sum/count; var=(sumsq/count)-(mean*mean); print "Variance:", var}' traffic.csv

These calculations reflect the standard deviation but ignore the square root.

# Calculating percentiles

// Calculating quartiles

Quartiles divide the sorted data into four equal parts. They are particularly useful for understanding data distribution:

cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | awk '
{arr[NR]=$1; count=NR}
END {
  q1_pos = (count+1)/4
  q2_pos = (count+1)/2
  q3_pos = 3*(count+1)/4
  print "Q1 (25th percentile):", arr[int(q1_pos)]
  print "Q2 (Median):", (count%2==1) ? arr[int(q2_pos)] : (arr[count/2]+arr[count/2+1])/2
  print "Q3 (75th percentile):", arr[int(q3_pos)]
}'

This script stores the sorted values in an array, calculates the quartile positions using the formula ( (n+1)/4 ), and extracts the values from those positions. The code outputs:

Q1 (25th percentile): 1100
Q2 (Median): 1355
Q3 (75th percentile): 1520

// Calculating any percentile

You can calculate any percentile by adjusting the rank calculation. The following versatile approach uses linear interpolation:

PERCENTILE=90
cut -d',' -f2 traffic.csv | tail -n +2 | sort -n | awk -v p=$PERCENTILE '
{arr[NR]=$1; count=NR}
END {
  pos = (count+1) * p/100
  idx = int(pos)
  frac = pos - idx
  if(idx >= count) print p "th percentile:", arr[count]
  else print p "th percentile:", arr[idx] + frac * (arr[idx+1] - arr[idx])
}'

This calculates the position as ( (n+1) times (percentile/100) ) and then uses linear interpolation between the array indices for fractional positions.

# Working with multiple columns

Often you will need to calculate statistics in multiple columns at once. Here's how to calculate averages for visitors, page views, and bounce rate at the same time:

awk -F',' '
NR>1 {
  v_sum += $2
  pv_sum += $3
  br_sum += $4
  count++
}
END {
  print "Average visitors:", v_sum/count
  print "Average page views:", pv_sum/count
  print "Average bounce rate:", br_sum/count
}' traffic.csv

This keeps separate accumulators for each column and divides the same number across all three, giving the following result:

Average visitors: 1340
Average page views: 4850
Average bounce rate: 45.06

// Calculating correlations

Correlation measures the relationship between two variables. The Pearson correlation coefficient ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation):

awk -F', *' '
NR>1 {
  x[NR-1] = $2
  y[NR-1] = $3

  sum_x += $2
  sum_y += $3

  count++
}
END {
  if (count < 2) exit

  mean_x = sum_x / count
  mean_y = sum_y / count

  for (i = 1; i <= count; i++) {
    dx = x[i] - mean_x
    dy = y[i] - mean_y

    cov   += dx * dy
    var_x += dx * dx
    var_y += dy * dy
  }

  sd_x = sqrt(var_x / count)
  sd_y = sqrt(var_y / count)

  correlation = (cov / count) / (sd_x * sd_y)

  print "Correlation:", correlation
}' traffic.csv

This calculates the Pearson correlation by dividing the covariance by the product of the standard deviations.

# Application

The command line is a powerful tool for statistical analysis. You can process volumes of data, calculate elaborate statistics, and automate reports - all without installing anything beyond what's already in your system.

These skills complement, not replace, your knowledge of Python and R. Operate command-line tools to quickly explore and validate your data, then move on to specialized tools for elaborate modeling and visualization as needed.

The best part is that these tools are available in virtually every system you'll operate in your data science career. Open a terminal and start exploring your data.

Bala Priya C is a software developer and technical writer from India. He likes working at the intersection of mathematics, programming, data analytics and content creation. Her areas of interest and specialization include DevOps, data analytics and natural language processing. She enjoys reading, writing, coding and coffee! He is currently working on learning and sharing his knowledge with the developer community by writing tutorials, guides, reviews, and more. Bala also creates engaging resource overviews and coding tutorials.

Categories

Command-line statistics for beginner data scientists

# Entry

# Configuring sample data

# Data mining

// Counting rows in a data set

// Viewing your data

// Single column extraction

# Calculating measures of central tendency

// Finding the middle ground (average)

// Finding the median

// Finding the mode

# Calculating dispersion (or dispersion) measures

// Finding the maximum value

// Finding the minimum value

// Finding the minimum and maximum values

// Calculation of (population) standard deviation

// Calculating the sample standard deviation

// Calculating variance

# Calculating percentiles

// Calculating quartiles

// Calculating any percentile

# Working with multiple columns

// Calculating correlations

# Application

Don’t expect any massive surprises in government foreign files

Anthropic sues Department of Defense over supply chain risk labeling

Improving the ability of AI models to explain their predictions

Can Artificial Intelligence Kill the Venture Capitalist?

Science says left-handed people are more competitive

More News

Don’t expect any massive surprises in government foreign files

Science says left-handed people are more competitive

5 useful Python scripts to automate exploratory data analysis

Sleep apnea often goes undetected in women. This is starting to change

Don’t expect any massive surprises in government foreign files

Anthropic sues Department of Defense over supply chain risk labeling

Improving the ability of AI models to explain their predictions