Kedro Guide: Your production-ready data science toolkit

Photo by the editor

# Entry

Data analytics projects they usually start with exploratory Python notebooks, but at some stage they need to be moved to production settings, which can be hard if not carefully planned.

QuantumBlack framework, Cedaris an open source tool that bridges the gap between experimental notebooks and production-ready solutions, translating concepts about design structure, scalability, and reproducibility into practice.

This article introduces and discusses the main features of Kedro, walking you through its core concepts for a better understanding before diving into the framework to implement real-world data science projects.

# First steps with Kedro

The first step to using Kedro is of course installing it in our working environment, preferably an IDE – Kedro cannot be fully utilized in notebook environments. Open your favorite Python IDE, such as VS Code, and type in the integrated terminal:

Then we create a modern Kedro project using this command:

If the command works well, you will be asked a few questions, including the name of your project. Let’s call it Churn Predictor. If the command doesn’t work, it may be due to a conflict due to having multiple versions of Python installed. In this case, the cleanest solution is to work in a virtual environment within your IDE. Here are some quick workaround commands to create it (ignore them if the previous command to create a Kedro project already worked!):

python3.11 -m venv venv

source venv/bin/activate

pip install kedro

kedro --version

Then select the following Python interpreter in your IDE to work on from now on: ./venv/bin/python.

At this point, if everything worked fine, you should have on the left side (in the “EXPLORER” panel in VS Code) the full project structure inside churn-predictor. In the terminal, let’s go to the main folder of our project:

It’s time to take a look at Kedro’s core features with our newly created project.

# Discovering the basic elements of Kedro

The first element that we will introduce – and that we will create ourselves – is data directory. In Kedro, this element is responsible for isolating data definitions from the main code.

An empty file has already been created within the project structure, which will act as a data directory. We just need to find it and fill it with content. In the IDE explorer, inside the file churn-predictor project, go to conf/base/catalog.yml and open this file, then add the following:

raw_customers:
  type: pandas.CSVDataset
  filepath: data/01_raw/customers.csv

processed_features:
  type: pandas.ParquetDataset
  filepath: data/02_intermediate/features.parquet

train_data:
  type: pandas.ParquetDataset
  filepath: data/02_intermediate/train.parquet

test_data:
  type: pandas.ParquetDataset
  filepath: data/02_intermediate/test.parquet

trained_model:
  type: pickle.PickleDataset
  filepath: data/06_models/churn_model.pkl

In low, we have just defined (not created yet) five datasets, each with an available key or name: raw_customers, processed_featuresand so on. The main data pipeline that we will create later should be able to refer to these datasets by their name, thus abstracting and completely isolating I/O from the code.

We’re going to need some now data which acts as the first data set in the above data catalog definitions. In this example you can take this sample synthetically generated customer data, download it and integrate it with your Kedro project.

Then we move on to data/01_rawcreate a modern file named customers.csvand add the contents of the sample dataset we will operate. The easiest way is to see the “raw” contents of the dataset file in GitHub, select everything, copy and paste into a newly created file in your Kedro project.

Now we will create Kedro pipelinewhich will describe the data analysis workflow that will be applied to our raw dataset. In the terminal, enter:

kedro pipeline create data_processing

This command creates several Python files inside src/churn_predictor/pipelines/data_processing/. Now we will open nodes.py and paste the following code:

import pandas as pd
from typing import Tuple

def engineer_features(raw_df: pd.DataFrame) -> pd.DataFrame:
    """Create derived features for modeling."""
    df = raw_df.copy()
    df['tenure_months'] = df['account_age_days'] / 30
    df['avg_monthly_spend'] = df['total_spend'] / df['tenure_months']
    df['calls_per_month'] = df['support_calls'] / df['tenure_months']
    return df

def split_data(df: pd.DataFrame, test_fraction: float) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Split data into train and test sets."""
    train = df.sample(frac=1-test_fraction, random_state=42)
    test = df.drop(train.index)
    return train, test

The two functions we just defined work as nodes that can apply transformations to a dataset in a repeatable, modular workflow. The first uses basic, illustrative feature engineering, creating several derived features from the raw ones. Meanwhile, the second function defines the division of the data set into training and test sets, e.g. for further machine learning modeling.

There is another Python file in the same subdirectory: pipeline.py. Let’s open it and add:

from kedro.pipeline import Pipeline, node
from .nodes import engineer_features, split_data

def create_pipeline(**kwargs) -> Pipeline:
    return Pipeline([
        node(
            func=engineer_features,
            inputs="raw_customers",
            outputs="processed_features",
            name="feature_engineering"
        ),
        node(
            func=split_data,
            inputs=["processed_features", "params:test_fraction"],
            outputs=["train_data", "test_data"],
            name="split_dataset"
        )
    ])

Some of the magic happens here: notice the names used for the inputs and outputs of nodes in the pipeline. Like Lego blocks, here we can flexibly refer to different dataset definitions in our data catalogstarting, of course, with the dataset of raw customer data we created earlier.

There are a few final setup steps left to get everything working. The proportion of test data for the partitioning node is defined as the parameter to be passed. In Kedro, we define these “external” code parameters by adding them to the file conf/base/parameters.yml file. Let’s add to this currently empty configuration file:

Additionally, by default, the Kedro project imports modules from the PySpark library by default, which we won’t actually need. IN settings.py (in the “src” subdirectory) we can disable this by commenting out and modifying the first few existing lines of code as follows:

# Instantiated project hooks.
# from churn_predictor.hooks import SparkHooks  # noqa: E402

# Hooks are executed in a Last-In-First-Out (LIFO) order.
HOOKS = ()

Save all changes, make sure you have pandas installed in a working environment, and prepare to run the project from the IDE terminal:

This may or may not work at first, depending on the version of Kedro you have installed. If this doesn’t work and you will receive DatasetErrorthe likely solution is pip install kedro-datasets Or pip install pyarrow (or maybe both!) and then try running again.

Hopefully you can receive several “INFO” messages informing you about different stages of the data flow. This is a good sign. In data/02_intermediate directory, you can find several parquet files containing the data processing results.

To finish, you can optionally pip install kedro-viz and run kedro viz to open an interactive graph of your brilliant workflow in your browser, as shown below:

Kedro-viz: an interactive workflow visualization tool

# Summary

We will leave further exploration of this tool to a possible future article. If you’ve landed here, you’ve been able to build your first Kedro project and learn its core components and features, understanding how they interact along the way.

Well done!

Ivan Palomares Carrascosa is a thought leader, writer, speaker and advisor in the fields of Artificial Intelligence, Machine Learning, Deep Learning and LLM. Trains and advises others on the operate of artificial intelligence in the real world.

Categories

Kedro Guide: Your production-ready data science toolkit

# Entry

# First steps with Kedro

# Discovering the basic elements of Kedro

# Summary

Children’s design companies are in turmoil

AI Engineering Hub Breakdown: 10 Agent Projects You Can Implement Today

Artificial intelligence-designed drugs from the DeepMind spinoff are being tested on humans

MIT researchers are building the world’s largest collection of Olympic-level math problems and making them available to everyone

Apple’s next CEO needs to release a killer AI product

More News

Children’s design companies are in turmoil

AI Engineering Hub Breakdown: 10 Agent Projects You Can Implement Today

Artificial intelligence-designed drugs from the DeepMind spinoff are being tested on humans

7 practical OpenClaw employ cases you should know

Children’s design companies are in turmoil

AI Engineering Hub Breakdown: 10 Agent Projects You Can Implement Today

Artificial intelligence-designed drugs from the DeepMind spinoff are being tested on humans