Overview
In part one of a four-part series, we are going to set up and walkthrough the Kedro framework. Kedro is an open-source Python framework that provides you with scaffolding to ensure your data projects adhere to software-engineering best practices. In turn, this makes your project reproducible and robust to requirement changes.
Introduction
In this post we are going to talk about and walkthrough using the Kedro framework. To keep things simple, we are going to use a clean dataset provided by the National Oceanic and Atmospheric Association (NOAA) that measures the monthly temperature changes across twelve Brazilian cities. You can download the dataset off of Kaggle. We are going to keep things as simple as possible and only go over the key components of a Kedro project which include:
- Creating a Kedro project
- Defining a data catalog
- Creating a Kedro pipeline
- Integrating Kedro plugins
If you want to learn how to use Kedro’s more advanced features check out the official documention.
Getting Started
Let’s start off by creating a clean environment. We are going to use Conda
in this example but you can you whatever virtual environment you prefer. After installing Anaconda or Miniconda, you can create a new environment by executing the following command in your terminal:
# Creates a new virtual environment
conda create -c conda-forge -n my_new_environment python=3.12 -y
The argument -c conda-forge
tells conda to put the conda-forge repository at a higher priority over the default
repository. We also installed python version 3.12 in this new environment. Now that we have a clean environment we need to install Kedro
into that environment. Activate your new environment and install Kedro with:
# Installs kedro
pip install -U kedro==0.19.11
You can verify your installation by executing:
# Displays installed kedro version and extensions
kedro info
Creating a New Kedro Project
Now that we have Kedro installed, we need to create a new Kedro project. You can use the Kedro command-line interface (CLI):
# Creates a new Kedro project
kedro new
You will be prompted to pick a name for your project. Let’s call our project Climate-Brazil
. Next, you will be prompted to select the Kedro tools you want to utilize. For this project, let’s go with 1, 3, 5, 7
which corresponds to Linting, Logging, Data Folder, and Kedro-Viz
. Once you have created your project, you’ll notice that Kedro created a project directory, using the project name we selected, with subdirectories and files that serve as the scaffolding of your project.
Note that there are some files in the image below that you won’t have at this moment. These will be added/generated as you continue through this walkthrough.
Kedro Project Structure
Let’s briefly walkthrough the 5 subdirectories that Kedro created for us.
conf
: This directory contains your project configurations. These can be things from API/Database credentials to Model parameters. This is also where you can create multiple project environments likeDev
andProd
.data
: This directory is where you will store your data and your project artifacts.notebooks
: This directory is where you can interactively develop your project before creating modular nodes and pipelinessrc
: This directory is where you will put your source code defined as nodes and chained together into pipelinestests
: This directory is where you will write your unit tests
I know that was really brief but don’t worry we will learn more specifics as we develop the project.
Declare Project Dependencies
Now that we have a project created, let’s add any packages that our project will depend on. In the root of our project directory Kedro created a requirements.txt
file for us. Kedro will add some dependencies into the file by default but we will need some additional packages for our project. We are going to need polars
, plotly
, and pre-commit
so let’s modify requirements.txt
to contain the following:
# ./requirements.txt
ipython>=8.10
jupyterlab>=3.0
kedro-datasets>=3.0
kedro-viz>=6.7.0
kedro[jupyter]~=0.19.11
notebook
pre-commit~=3.8.0
polars~=1.22.0
pyarrow~=19.0.1
plotly~=5.24.1 openpyxl~=3.1.5
Now that we have declared our dependencies, let’s go ahead and install them into our environment.
pip install -r requirements.txt
Defining a Data Catalog
All Input/Output (IO) operations are defined in a data catalog, when using Kedro. This allows you to declaratively define your data sources whether they are stored as local files, or stored remotely in a SQL/NoSQL database, or just in some form of remote storage like S3.
In our simple example, all the data are stored locally. Therefore, we don’t need to define any credentials to access the data. However, in practice it is likely that you need to access data from a database or cloud storage. In that case, you would define your credentials in /conf/local/credentials.yml
.
Before we edit our catalog, go ahead and download the data to /data/01_raw/
. You can get the data files from Kaggle.
We declare our base
catalog by modifying /conf/base/catalog.yml
. The base catalog will be inhereted by all other Kedro enviornments you create. For example, if you create a dev
environment you don’t need to repeat the catalog entries that are in /conf/base/catalog.yml
inside of /conf/dev/catalog.yml
.
If you declare a catalog entry in /conf/dev/catalog.yml
that shares the same key as an entry in /conf/base/catalog.yml
you will override the base
catalog when you are working in the dev
environment.
There are several ways that we can add our data to the catalog. The simplest way is to define one entry for each Brazilian city:
#/conf/base/catalog.yml
belem_dataset:
type: polars.LazyPolarsDataset
file_format: csv
filepath: data/01_raw/station_belem.csv
curitiba_dataset:
type: polars.LazyPolarsDataset
file_format: csv
filepath: data/01_raw/station_curitiba.csv
# Rest of the cities follow-suit
If you have a lot of cities it can get really tedious. Luckily, Kedro has a feature called Dataset Factories that assist you in reducing repeated code in your data catalog. We can declare all of our cities with one block like this:
#/conf/base/catalog.yml
# Dataset Factory
"{city}_dataset":
type: polars.LazyPolarsDataset
file_format: csv
filepath: data/01_raw/station_{city}.csv
Kedro is smart enough that during runtime it will use Regex to match {city}
to the corresponding dataset city. Finally, the approach we will use is to declare the data as a PartitionedDataset
. You can think of each Brazilian city as a partition of the entire dataset that would be composed of all the Brazilian cities. Kedro will return a dictionary object with each file’s name as the key and the load method as the corresponding value.
Note that when using a PartitionedDataset
the data is loaded lazily. This means that the data is actually not loaded until you call the load method. This prevents Out-Of-Memory (OOM) errors if your data can’t fit into memory.
We can declare our data as a PartitionedDataset
like this:
# /conf/base/catalog.yml
cities_temperatures:
type: partitions.PartitionedDataset
path: data/01_raw/
dataset:
type: polars.LazyPolarsDataset
file_format: csv
Now we can load our dataset by simply starting up a Kedro session (Kedro does this for you) and calling catalog.load("cities_temperatures")
. You can even work with the data interactively in a jupyter notebook. Kedro automatically loads the session and context when you run kedro jupyter lab
on the command-line. This means that once the jupyter kernel is running you already have your catalog
object loaded in your environment.
Creating a New Kedro Pipeline
Typically, once you have defined your data catalog you’d want to work interactively on your cleaning and wrangling your data into a usable format. As mentioned earlier you can do that using kedor jupyter lab
. Once you have your processing/modeling/reporting code in check you should write modular functions that when chained together would constitute a pipeline. That could be a data processing pipeline, a machine learning pipeline, a monitoring pipeline, etc… This is where Kedro pipelines come in to play. We can create a new pipeline using the Kedro CLI by executing:
kedro pipeline create data_processing
Kedro will create the structure for your new pipeline in /src/climate_brazil/pipelines/data_processing
. We will define all your modular code inside of /src/climate_brazil/pipelines/data_processing/nodes.py
and then we will declare our pipeline inside /src/climate_brazil/pipelines/data_processing/pipeline.py
. Let’s start with the nodes.
Each of our Brazilian cities dataset is in a wide format where we have years across the rows and the months within that year across the columns. We need to merge all of our cities together and unpivot the data so that we are in long format where both year and month are across the rows. The following function does what we just described:
# /src/climate_brazil/pipelines/data_processing/nodes.py
def process_datasets(partitioned_dataset: dict) -> pl.DataFrame:
"""
Combine all cities into one dataframe and unpivot the data into long format
---
params:
partitioned_dataset: Our data partiotioned into key (filename) value(load method) pairs
"""
# Missing values encoding
= 999.9
mv_code
# Add a city column to each partiotion so that when we merge them all together we can identify each city
= [
datasets =pl.lit(re.findall(r"(?<=_).*(?=\.)", k)[0]))
v().with_columns(cityfor k, v in partitioned_dataset.items()
]
= pl.concat(datasets)
df_merged
= (
df_processed "D-J-F", "M-A-M", "J-J-A", "S-O-N", "metANN")
df_merged.drop("YEAR": "year"})
.rename({# Need to collect because can't unpivot a lazyframe
.collect()
.unpivot(=[
on"JAN",
"FEB",
"MAR",
"APR",
"MAY",
"JUN",
"JUL",
"AUG",
"SEP",
"OCT",
"NOV",
"DEC",
],=["city", "year"],
index="month",
variable_name="temperature",
value_name
)
.with_columns("month")
pl.col(str.to_titlecase()
.str.strptime(dtype=pl.Date, format="%b")
.
.dt.month()
)
.with_columns(=pl.date(year=pl.col("year"), month=pl.col("month"), day=1),
date
)
.with_columns(
pl.when("temperature")
pl.col(== mv_code # This is how missing data is coded in the dataset
)None)
.then("temperature"))
.otherwise(pl.col(
.name.keep(),"city").str.to_titlecase(),
pl.col(
)"year", "month")
.drop(
)return df_processed
Let’s also define a function that will plot our time series data.
def timeseries_plot(processed_dataframe: pl.DataFrame) -> go.Figure:
"""
Plots each Brazilian city temperature time series
"""
= go.Figure()
fig for city in processed_dataframe.select("city").unique(maintain_order=True).to_series():
fig.add_trace(
go.Scatter(= processed_dataframe.filter(pl.col("city")==city).sort('date')['date'],
x = processed_dataframe.filter(pl.col("city")==city).sort('date')['temperature'],
y = city,
name ="<b>Date</b>: %{x}<br><b>Temperature</b>: %{y}"
hovertemplate
)
)
fig.update_layout(= "Temperature Measurements of Brazilian Cities",
title =dict(
xaxis= "Date",
title =dict(
rangeselector=list([
buttonsdict(count=1,
="1y",
label="year",
step="backward"),
stepmodedict(count=5,
="5y",
label="year",
step="backward"),
stepmodedict(count=10,
="10y",
label="year",
step="backward"),
stepmodedict(step="all", label="All")
])
),=dict(
rangeslider=True
visible
),type="date",
='black',
rangeselector_font_color='hotpink',
rangeselector_activecolor='lightblue',
rangeselector_bgcolor
),=dict(
yaxis= "Temperature in Celsius"
title
)
)return fig
This will produce the following figure:
Now that we have defined our nodes, let’s see how we can chain them together into a simple pipeline. We define our data processing pipeline in the pipeline.py
file that Kedro creates for us in the data_processing
directory.
# /src/climate_brazil/pipelines/data_processing/pipeline.py
from kedro.pipeline import node, Pipeline, pipeline # noqa
from .nodes import process_datasets, timeseries_plot
def create_pipeline(**kwargs) -> Pipeline:
return pipeline([
node(=process_datasets,
func=dict(partitioned_dataset = "cities_temperatures"),
inputs="cities_temperatures_processed",
outputs="process_datasets"
name
),
node(=timeseries_plot,
func=dict(processed_dataframe = "cities_temperatures_processed"),
inputs="cities_ts_plot",
outputs="timeseries_plot"
name
), ])
Notice that the input to the process_datasets
function is the name that we chose for our dataset when we defined it inside of our catalog. Also note that we are choosing to name the output from the process_datasets
function as cities_temperatures_processed
and we are passing that as the input to the function timeseries_plot
.
Kedro automatically infers the dependencies of your pipeline and will run it in an order that may not be the same order as you defined.
Before we move on to the next section, it is important to know that both our outputs will be MemoryDatasets
this mean that they will only exist while the Kedro session is still active. In order to persist the outputs we need to add them to our catalog.
# conf/base/catalog.yml
# Raw data ---------------------------------------
cities_temperatures:
type: partitions.PartitionedDataset
path: data/01_raw/
dataset:
type: polars.LazyPolarsDataset
file_format: csv
# Processed data ---------------------------------------
cities_temperatures_processed:
type: polars.EagerPolarsDataset
file_format: parquet
filepath: data/02_intermediate/cities_temperatures_processed.pq
cities_ts_plot:
type: plotly.JSONDataset
filepath: data/08_reporting/cities_ts_plot.json
We are going to store our processed data locally as a parquet file and our figure as a json file. Great, we are now ready to run our pipeline!
Running a Kedro Pipeline
To run the data processing pipeline we just defined you can execute:
kedro run
That will run all your defined pipelines in the order that you defined them and since we only have one pipeline you will run the data processing pipeline. Okay, so how do we run a specific pipeline? For that you can use:
kedro run --pipeline data_processing
Kedro also allows you to run specific nodes. For example, if we just wanted to process the data without generating a plot we could run:
kedro run --nodes process_datasets
Where process_datasets
is the name we gave the node when we defined the pipeline.
Kedro is very flexible and allows you to run in a variety of options. You can checkout the Kedro run documentation for more information.
Pipeline Graphs with Kedro-Viz
When we created a new Kedro project earlier, we told kedro to add to our dependencies the kedro-viz plugin. This plugin allows you to visualize your pipeline or pipelines as a graph. To view the simple pipeline we built earlier you can execute at the command-line the following:
kedro viz run
Your browser should automatically launch and you will be greeted by the following screen.
Our simple example doesn’t do Kedro-Viz justice. In practice, you’ll have many datasets coming from disparate sources that will require more complex processing and joining. In such a scenario, being able to see the lineage and dependencies of your data becomes very useful. Kedro-Viz is interactive in that you can optionally preview the data, you can view static or interactive figures, and you can view the code of the nodes all in the user interface. I recommend that you try this plugin for you next project!
Source Control & Pre-Commit
I am sure that everyone reading this already understands the importance of source/version control, so I will keep this breif. When we created our project Kedro was nice enough to create a .gitignore
file and .gitkeep
files. The .gitignore
file makes sure that you don’t accidentally commit any data or any credentials that you store in conf/local/credentials.yml
to a remote repository. Kedro does not, unfortunately, set up any pre-commit configuration so you need to do that manually. Here is an example of a pre-commit configuration that includes linting with ruff
checking your codebase for missing docstrings with interrogate
and stripping out any confidentiat data from notebooks with nbstripout
.
# .pre-commit-config.yaml
repos:
- repo: https://github.com/astral-sh/ruff-pre-commit
# Ruff version.
rev: v0.11.2
hooks:
# Run the linter.
- id: ruff
types_or: [ python, pyi ]
args: [ --fix ]
# Run the formatter.
- id: ruff-format
types_or: [ python, pyi ]
- repo: https://github.com/econchick/interrogate
rev: 1.7.0
hooks:
- id: interrogate
args: [--config=pyproject.toml]
pass_filenames: false
- repo: https://github.com/kynan/nbstripout
rev: 0.7.1
hooks:
- id: nbstripout
You also need to specify interrogate
’s configuration within pyproject.toml
you can add the following at the bottom of your file:
[tool.interrogate]
ignore-init-method = false
ignore-init-module = false
ignore-magic = false
ignore-semiprivate = false
ignore-private = false
ignore-property-decorators = false
ignore-module = false
ignore-nested-functions = false
ignore-nested-classes = false
ignore-setters = false
ignore-overloaded-functions = false
fail-under = 80
exclude = ["tests", "docs", "build", "src/climate_brazil/__main__.py"]
ext = []
verbose = 2
quiet = false
whitelist-regex = []
color = true
omit-covered-files = false
After you’re done setting up your configuration you need to initialize a local git repository and install your pre-commit configurations.
git init
pre-commit install
Wonderul, we are all set now with source control and mitigating the possibility of commiting anything confidential to your repository.
Running Pipelines in Containers
We can also run the pipeline we just built inside of a container. Kedro maintains the kedro-docker
plugin which facilitates getting your Kedro project running inside a container.
While the plugin is named kedro-docker
you can use it with other containerization frameworks such as Podman
First, we need to install the plugin. You can add the following to your ./requirements.txt
file:
kedro-docker~=0.6.2
Then execute:
pip install kedro-docker~=0.6.2
With the plugin installed, Kedro will generate a Dockerfile
, .dockerignore
, and .dive-ci
that corresponds to your Kedro project by executing:
kedro docker init
Make sure you Docker Engine is running otherwise the previous step will fail.
Let’s take a look at the generated Dockerfile:
#./Dockerfile
ARG BASE_IMAGE=python:3.9-slim
FROM $BASE_IMAGE as runtime-environment
# update pip and install uv
RUN python -m pip install -U "pip>=21.2"
RUN pip install uv
# install project requirements
COPY requirements.txt /tmp/requirements.txt
RUN uv pip install --system --no-cache-dir -r /tmp/requirements.txt && rm -f /tmp/requirements.txt
# add kedro user
ARG KEDRO_UID=999
ARG KEDRO_GID=0
RUN groupadd -f -g ${KEDRO_GID} kedro_group && \
useradd -m -d /home/kedro_docker -s /bin/bash -g ${KEDRO_GID} -u ${KEDRO_UID} kedro_docker
WORKDIR /home/kedro_docker
USER kedro_docker
FROM runtime-environment
# copy the whole project except what is in .dockerignore
ARG KEDRO_UID=999
ARG KEDRO_GID=0
COPY --chown=${KEDRO_UID}:${KEDRO_GID} . .
EXPOSE 8888
CMD ["kedro", "run"]
Kedro will automatically assume that you want to run all your pipelines (or your defualt pipeline) but you can, quite easily, change the specified command in the generated docker file.
Now that we have a Dockerfile
and .dockerignore
we can build our image.
kedro docker build
Finally, you can run your pipeline from within the container by either using your normal docker-cli
commands or you can use the kedro-docker
plugin and execute:
kedro docker run
Summary
In this post we walked through using the Kedro framework and learned how to follow software engineering best practices that ensure that your projects are reproducible, modular, and easy to maintain as a project matures. We learned about the data catalog, how to define nodes and subsequently linking nodes together to build pipelines. We then looked at a couple Kedro plugins like kedro-viz
and kedro-docker
that expanded the functionality of our project. We also talked about and walked through good practices to follow when implementing source control with git
and pre-commit
. All this in just part one of our series! We have a long ways to go still but I hope you are excited for what comes next.
Coming Next
If you use a popular/mainstream machine learning framework like PyTorch
, TensorFlow
, Scikit-Learn
, or XGBoost
then reproducibility and scalability are quite easy because you’ll typically find that most MLOPs and distributed frameworks natively support these tools. What do you do if you have a custom solution? Let’s say you are using a probabilistics programming language like Stan
or PyMC
and there isn’t native support for these tools. Well, that is what we are going to do in the following parts of this series. In part two we will fit a time series model using PyMC
and talk about how to use ModelBuilder
from pymc-extras
to build a production ready model. I hope to see you there!