
Image by author
Julia is another programming language like Python and R. It combines the speed of low-level languages like C with the simplicity of Python. Julia is becoming popular in the data science space, so if you’re looking to expand your portfolio and learn a up-to-date language, you’ve come to the right place.
In this tutorial, we will learn how to set up Julia for data science, load data, perform data analysis, and then visualize it. The tutorial is so uncomplicated that anyone, even a student, can start using Julia for data science within 5 minutes.
1. Configuring the environment
- Download Julia and install the package by going to (julialang.org).
- Now we need to configure Julia for Jupyter Notebook. Launch the terminal (PowerShell), type `julia` to start the Julia REPL, and then type the following command.
using Pkg
Pkg.add("IJulia")
- Launch Jupyter Notebook and create a up-to-date notebook with Julia as the kernel.
- Create a up-to-date code cell and enter the following command to install the necessary data analysis packages.
using Pkg
Pkg.add("DataFrames")
Pkg.add("CSV")
Pkg.add("Plots")
Pkg.add("Chain")
2. Loading data
In this example we apply Online Sales Data Set from Kaggle. Contains data on online sales transactions across various product categories.
We will load a CSV file and convert it to DataFrames, which is similar to Pandas DataFrames.
using CSV
using DataFrames
# Load the CSV file into a DataFrame
data = CSV.read("Online Sales Data.csv", DataFrame)
3. Data mining
Instead of `head` function we will apply `first` function to display the top 5 rows of the DataFrame.


To generate a data summary, we will apply the `describe` function.


Similar to Pandas DataFrame, we can view specific values by specifying the row number and column name.
Exit:
4. Data manipulation
We will apply the `filter` function to filter data based on some values. It requires a column name, condition, value and a DataFrame.
filtered_data = filter(row -> row[:"Unit Price"] > 230, data)
last(filtered_data, 5)


We can also create a up-to-date column similar to Pandas. It’s that uncomplicated.
data[!, :"Total Revenue After Tax"] = data[!, :"Total Revenue"] .* 0.9
last(data, 5)


Now we will calculate the average values of “Total Revenue After Tax” based on different “Product Categories”.
using Statistics
grouped_data = groupby(data, :"Product Category")
aggregated_data = combine(grouped_data, :"Total Revenue After Tax" .=> mean)
last(aggregated_data, 5)


5. Visualization
The visualization is similar to Seaborn. In our case, we visualize a bar chart of the recently created aggregated data. We provide columns X and Y, then a title and labels.
using Plots
# Basic plot
bar(aggregated_data[!, :"Product Category"], aggregated_data[!, :"Total Revenue After Tax_mean"], title="Product Analysis", xlabel="Product Category", ylabel="Total Revenue After Tax Mean")
Most of the total average revenue is generated by electronics. The visualization looks perfect and clear.


To generate histograms, we only need to provide column X data and labels. We want to visualize the frequency of items being sold.
histogram(data[!, :"Units Sold"], title="Units Sold Analysis", xlabel="Units Sold", ylabel="Frequency")


It looks like most people bought one or two items.
To save a visualization, we will apply the `savefig` function.
6. Creating a data processing pipeline
Creating the right data flow is necessary to automate data processing workflows, ensure data consistency, and enable scalable and competent data analysis.
We will apply the `Chain` library to create chains of different functions, previously used to calculate the total average revenue based on different product categories.
using Chain
# Example of a uncomplicated data processing pipeline
processed_data = @chain data begin
filter(row -> row[:"Unit Price"] > 230, _)
groupby(_, :"Product Category")
combine(_, :"Total Revenue" => mean)
end
first(processed_data, 5)


To save the processed DataFrame as a CSV file, we will apply the `CSV.write` function.
CSV.write("output.csv", processed_data)
Application
In my opinion, Julia is simpler and faster than Python. Many of the syntaxes and features I am used to are also available in Julia, such as Pandas, Seaborn, and Scikit-Learn. So why not learn a up-to-date language and start doing things better than your colleagues? It will also assist you get a research job, as most clinical researchers prefer Julia over Python.
In this tutorial, we learned how to set up a Julia environment, load a dataset, perform advanced data analysis and visualization, and build a data pipeline for repeatability and reliability. If you want to learn more about Julia for data science, let me know so I can write even easier tutorials for you.
Abid Ali Awan (@1abidaliawan) is a certified data science professional who loves building machine learning models. He currently focuses on content creation and writing technical blogs on machine learning and data science technologies. Abid has a Masters in Technology Management and a Bachelors in Telecommunication Engineering. His vision is to build an AI product using Graph Neural Network for students struggling with mental illness.
