Build your own uncomplicated python and docker pipeline

Share

Build your own uncomplicated python and docker pipelinePhoto by the author Ideogram

The data is a resource that drives our work as data specialists. Without the right data, we cannot perform our tasks, and our business will not gain a competitive advantage. Therefore, securing relevant data is crucial for any professional data, and data pipelines are systems designed for this purpose.

Data pipelines are systems designed for moving and transforming data from one source to another. These systems are part of the general infrastructure of each company that is based on data because they guarantee that our data is reliable and always ready to operate.

Building a data pipeline may seem complicated, but a few uncomplicated tools are enough to create reliable data pipelines with just a few code lines. In this article, we will examine how to build a uncomplicated data pipeline using Python and Docker, which you can operate in your daily data.

Let’s get it.

Building a data pipeline

Before building our data pipeline, let’s understand the concept of ETL, which means extract, transformation and load. ETL is a process in which the data pipeline performs the following actions:

  • Escise data from various sources.
  • Transform data into a correct format.
  • Load data to the available memory location.

ETL is a standard model for data pipelines, so what we build will be in line with this structure.

Thanks to Python and Docker, we can build a data pipeline around the ETL process with a uncomplicated configuration. Python is a valuable tool for organizing any data flow activity, while Docker is useful for managing data pipelines application using containers.

Let’s configure our data pipeline with Python and Docker.

Step 1: Preparation

First of all, we must not have Python and Docker installed in our system (we will not discuss it here).

In our example, we will operate A set of data from a heart attack with kaggle as a source of data to develop our ETL process.

Everything on the spot will prepare the project structure. In general, a uncomplicated data pipeline will have the following skeleton:

simple-data-pipeline/
├── app/
│   └── pipeline.py
├── data/
│   └── Medicaldataset.csv
├── Dockerfile
├── requirements.txt
└── docker-compose.yml

There is a main folder called simple-data-pipelinewhich contains:

  • Some app Folder containing pipeline.py file.
  • AND data folder containing source data (Medicaldataset.csv).
  • . requirements.txt Environmental dependence file.
  • . Dockerfile For Docker configuration.
  • . docker-compose.yml File to define and launch our Docker application with many contacts.

First, we will fill out requirements.txt A file that contains libraries required for our project.

In this case, we will only operate the following library:

In the next section we will configure a data pipeline using our sample data.

Step 2: Configure the pipeline

We will put on Python pipeline.py File to the ETL process. In our case, we will operate the following code.

import pandas as pd
import os

input_path = os.path.join("/data", "Medicaldataset.csv")
output_path = os.path.join("/data", "CleanedMedicalData.csv")

def extract_data(path):
    df = pd.read_csv(path)
    print("Data Extraction completed.")
    return df

def transform_data(df):
    df_cleaned = df.dropna()
    df_cleaned.columns = [col.strip().lower().replace(" ", "_") for col in df_cleaned.columns]
    print("Data Transformation completed.")
    return df_cleaned

def load_data(df, output_path):
    df.to_csv(output_path, index=False)
    print("Data Loading completed.")

def run_pipeline():
    df_raw = extract_data(input_path)
    df_cleaned = transform_data(df_raw)
    load_data(df_cleaned, output_path)
    print("Data pipeline completed successfully.")

if __name__ == "__main__":
    run_pipeline()

The pipeline is compatible with the ETL process, in which we charge the CSV file, we perform data transformations such as dropping missing data and cleaning the names of columns, and charging purified data to the up-to-date CSV file. We wrapped these steps in one run_pipeline A function that performs the whole process.

Step 3: Configure Docker file

We will fill out with the prepared Python pipeline file Dockerfile To configure the Docker container configuration using the following code:

FROM python:3.10-slim

WORKDIR /app
COPY ./app /app
COPY requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt

CMD ["python", "pipeline.py"]

In the above code, we specify that the container will operate Python version 3.10 as an environment. Then we set the operating container catalog /app and copy everything from our local app container folder app informant. We also copy requirements.txt File and make a PIP installation in a container. Finally, we specify the command to start the Python script after starting the container.

WITH Dockerfile ready, we will prepare docker-compose.yml File for managing general performance:

version: '3.9'

services:
  data-pipeline:
    build: .
    container_name: simple_pipeline_container
    volumes:
      - ./data:/data

The yaml file above, after making, will build a Docker image from the current catalog with the assist of Dockerfile. We will also manage local data folder to data container folder, thanks to which the data set is available to our script.

Making a pipeline

When preparing all files, we will make a data pipeline at Docker. Go to the main project folder and run the following command in the command line to build a Docker image and make a pipeline.

docker compose up --build

If you run it successfully, you will see the information journal, just like the following:

 ✔ data-pipeline                           Built                                                                                   0.0s 
 ✔ Network simple_docker_pipeline_default  Created                                                                                 0.4s 
 ✔ Container simple_pipeline_container     Created                                                                                 0.4s 
Attaching to simple_pipeline_container
simple_pipeline_container  | Data Extraction completed.
simple_pipeline_container  | Data Transformation completed.
simple_pipeline_container  | Data Loading completed.
simple_pipeline_container  | Data pipeline completed successfully.
simple_pipeline_container exited with code 0

If everything is successfully done, you’ll see up-to-date ones CleanedMedicalData.csv File in the data folder.

Congratulations! You just created a uncomplicated data pipeline with Python and Docker. Try to operate different data sources and ETL processes to see if you can handle a more complicated pipeline.

Application

Understanding data pipelines is crucial for every data specialist, because they are necessary to obtain relevant data for their work. In this article, we examined how to build a uncomplicated data pipeline with Python and Docker and learned how to make it.

I hope it helped!

Cornellius Yudha Wijaya He is a data assistant and data writer. Working full -time at Allianz Indonesia, he loves to share Python and data tips through social media and media writing. Cornellius writes on various AI topics and machine learning.

Latest Posts

More News