Configuring machine learning pipeline on Google Cloud Platform

Share

Configuring machine learning pipeline on Google Cloud PlatformPhoto via editor Chatgpt

# Entry

Machine learning has become an integral part of many companies and companies that do not employ their risk. Given how critical the models in providing a competitive advantage are, it is natural that many companies want to integrate them with their systems.

There are many ways to configure the machine learning pipeline system to assist the company, and one option is hosting with a cloud supplier. There are many advantages of developing and implementing machine learning models in the cloud, including scalability, profitability and simplified processes compared to building the entire internal pipeline.

The choice of cloud supplier depends on the company, but in this article we will examine how to configure the machine learning pipeline on Google Cloud Platform (GCP).

Let’s start.

# Preparation

Before continuing, you must have a Google account because we will employ GCP. After creating an account, get access to Google Cloud Console.

Create a novel project after the console.

Configuring machine learning pipeline on Google Cloud PlatformConfiguring machine learning pipeline on Google Cloud Platform

Then, before anything else, you need to configure the billing configuration. The GCP platform requires registering information about payment before you can do most of the things on the platform, even using a free trial account. However, you do not have to worry, because the example we will employ will not absorb a lot of your free loan.

Configuring machine learning pipeline on Google Cloud PlatformConfiguring machine learning pipeline on Google Cloud Platform

Enter all information about the settlement required to start the project. You may also need tax information and credit card to make sure they are ready.

After all, let’s start building our machine learning pipeline with GCP.

# Pipeline with machine learning from Google Cloud Platform

To build our machine learning pipeline, we will need an example of a set of data. We will employ Forecasting a heart attack Kaggle data set for this tutorial. Download the data and store it for now.

Then we need to configure data storage for our data set, from which the machine learning pipeline will employ. To do this, we must create a bucket of memory for our data set. Search “storage in the cloud” to create a bucket. It must have a unique global name. For now, you do not have to change any of the default settings; Just click Create.

Configuring machine learning pipeline on Google Cloud PlatformConfiguring machine learning pipeline on Google Cloud Platform

After creating a bucket, send a CSV file to it. If you did it correctly, you’ll see a set of data in the bucket.

Configuring machine learning pipeline on Google Cloud PlatformConfiguring machine learning pipeline on Google Cloud Platform

Then we will create a novel table that we can ask using the Bigquery service. Search “Bigquery” and click “Add data”. Choose “Google Cloud Storage” and choose a CSV file from a previously created bucket.

Configuring machine learning pipeline on Google Cloud PlatformConfiguring machine learning pipeline on Google Cloud Platform

Complete the information, especially the design place of the project, data kit form (create a novel data set or select the existing one) and the name of the table. In the case of the scheme, select “Auto-Detect” and then create a table.

Configuring machine learning pipeline on Google Cloud PlatformConfiguring machine learning pipeline on Google Cloud Platform

If you created it successfully, you can ask the table to see if you can access the data set.

Then search for AI Vertex and enable all recommended API interfaces. After completing, choose “Colab Enterprise”.

Configuring machine learning pipeline on Google Cloud PlatformConfiguring machine learning pipeline on Google Cloud Platform

Choose “Create a notebook” to create a notebook that we will employ for our elementary machine learning pipeline.

Configuring machine learning pipeline on Google Cloud PlatformConfiguring machine learning pipeline on Google Cloud Platform

If you know Google Colab, the interface will look very similar. If you want, you can import a notebook from an external source.

With the preparation of the notebook, connect with the executive time. For now, the default machine type is enough, because we do not need many resources.

Let’s start developing machine learning pipelines by asking the data from our Bigquery table. First, we need to initiate Bigquery’s customer using the following code.

from google.cloud import bigquery

client = bigquery.Client()

Then ask about our set of data in the Bigquery table using the following code. Change the project ID, data set and table name so that they fit what you created before.

# TODO: Replace with your project ID, dataset, and table name
query = """
SELECT *
FROM `your-project-id.your_dataset.heart_attack`
LIMIT 1000
"""
query_job = client.query(query)

df = query_job.to_dataframe()

The data is now found in Pandas Dataframe in our notebook. Let’s transform our target variable (“result”) into a numerical label.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

df['Outcome'] = df['Outcome'].apply(lambda x: 1 if x == 'Heart Attack' else 0)

Then prepare our training and testing sets.

df = df.select_dtypes('number')

X = df.drop('Outcome', axis=1)
y = df['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

⚠️ Note: df = df.select_dtypes('number') It is used to simplify the example, dropping all non-numeric columns. In a real script, this is an aggressive step that can reject useful categorical features. This is done here for simplicity, and engineering or engineering coding or coding would usually be taken into account.

When the data is ready, let’s train the model and assess its performance.

model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print(f"Model Accuracy: {accuracy_score(y_test, y_pred)}")

The accuracy of the model is only about 0.5. This can certainly be improved, but in this example we continue this elementary model.

Now let’s employ our model to predict and prepare results.

result_df = X_test.copy()
result_df['actual'] = y_test.values
result_df['predicted'] = y_pred
result_df.reset_index(inplace=True)

Finally, we will write the forecasts of our model on the novel Bigquery table. It should be remembered that the following code will replace the target table, if it exists, instead of joining it.

# TODO: Replace with your project ID and destination dataset/table
destination_table = "your-project-id.your_dataset.heart_attack_predictions"
job_config = bigquery.LoadJobConfig(write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE)
load_job = client.load_table_from_dataframe(result_df, destination_table, job_config=job_config)
load_job.result()

Thanks to this, you created a elementary machine learning pipeline in the Vertex AI notebook.

To improve this process, you can plan a notebook so that it works automatically. Go to your notebook’s actions and choose a “schedule”.

Configuring machine learning pipeline on Google Cloud PlatformConfiguring machine learning pipeline on Google Cloud Platform

Choose the frequency needed to start the notebook, for example every Tuesday or on the first day of the month. This is a elementary way to make sure that the machine learning pipeline works in accordance with the requirements.

All this in the case of configuring a elementary machine learning pipeline on GCP. There are many other, more ready for production ways to configure the pipeline, such as the employ of Kupeflow pipelines (KFP) or a more integrated Vertex AI pipeline service.

# Application

Google Cloud Platform provides users with an straightforward way to configure the machine learning pipeline. In this article, we learned how to configure a pipeline using various cloud services, such as cloud memory, Bigquery and Vertex AI. By creating a pipeline in the notebook form and planning it automatically, we can create a elementary, functional pipeline.

I hope it helped!

Cornellius Yudha Wijaya He is a data assistant and data writer. Working full -time at Allianz Indonesia, he loves to share Python and data tips through social media and media writing. Cornellius writes on various AI topics and machine learning.

Latest Posts

More News