A guide to Orchest for building ML pipelines

Orchest is a tool that can be utilized for building machine learning pipelines. Using this library we can build pipelines in a visualized manner using the Orchest provided user interface.  One of the best things about this tool is it doesn't require any third-party integration or DAGs.

Creating pipelines is one of the most important procedures in any application development task. We can define a pipeline as an arrangement of files that are interconnected and running them in a flow provides the final goal of the pipeline. There are a variety of platforms that are available to provide the service for building or making pipelines. Orchest is one of them and recently developed platforms that understand the requirement of users and provides a one-place interface for building pipelines. In this article, we are going to discuss the Orchest tool for building machine learning pipelines. The major points to be discussed in the article are listed below.

Table of contents 

  1. What is Orchest?
  2. Joining Orchest 
  3. Building pipeline
    1. Step 1: Data gathering
    2. Step 2: Data preprocessing
    3. Step 3: Defining model
    4. Step 4: Collecting accuracy  

Let’s begin with understanding what Orchest is.


Sign up for your weekly dose of what's up in emerging technology.

What is Orchest?

Orchest is a tool that can be utilized for building machine learning pipelines. Using this library we can build pipelines in a visualized manner using the Orchest provided user interface.  One of the best things about this tool is it doesn’t require any third-party integration or DAGs.Using this tool is very easy and we can use Jupiter lab or VSCode with this tool to build our machine learning models in pipeline settings. This tool provides its facilities with various languages like python,  R, and Julia.  

Pipeline build under Orchest contains steps that we used to build the model and every step contains an executable file and the UI of Orchest makes them interconnected with each other using the nodes. This tool creates a representation that tells us about the data flow from one step to another step. We can pick, drop and connect steps very easily with this tool and this feature makes the Orchest user-friendly. Visualization of the pipeline progress helps us in making the data flow accurate and also we can debug the codes if we find any errors. 

To install this toolkit in our environment we are required to have a Docker Engine version more than or equal to 20.10.7. To install this toolkit in macOS and Linux operating systems we need to clone and install Orchest in the environment that we can do using the following code

git clone https://github.com/orchest/orchest.git && cd orchest
./orchest install

After cloning and installing we can start it using the following lines of codes:

./orchest start

With the above-given features, Orchest also gives various other integrations like we can build a web app using the streamlit or we can write codes to ask for data from Postgres SQL. In this article, we are aiming to make a pipeline using Orchest. So that we can understand how it processes. Let’s start by implementing a pipeline that can classify iris data.

Are you looking for for a complete repository of Python libraries used in data science, check out here.

Joining Orchest 

Before getting started with the Orchest pipeline we are required to know how we can sign in or make an account with Orchest. For doing so we can go through this page. After making an account we can have various options using which a data pipeline can be built easily. For this article, we are using a free space where we get to practice our projects. As a free space, we get 50 GB volume with 2 vCPUs and 8 GiB specifications instance. Here you can get the whole pipeline. The overview of this pipeline will look like the following if not changed anything. 

Let’s just start with the process.

First of all, to make a pipeline we are required to click on the create pipeline button that we get in the pipeline tab after initiating our instance.

After creating a pipeline and giving it a name we will get a black page as given in the below image.

Here the new step button is for making our steps. Every step will hold a file that can be of R, Python, Julia extension files. In my pipeline, I have used the python language. Let’s take a look at how I make a pipeline for iris classification.

Building pipeline 

Step 1: Data gathering

The below image is for the step that helped the pipeline to get the data from sklearn. 

Under this step, we have got an ipynb file that can be opened in the jupyter notebook. In real-time we will also get a jupyter file for writing codes that can be accessed by clicking on the step. The following codes have been used for completing this step.

import orchest
import pandas as pd
from sklearn.datasets import load_iris
df_data, df_target = load_iris(return_X_y=True)
orchest.output((df_data, df_target), name="data")

In the above code, we are required to focus on the first line and the last line. Orchest we have imported that is already installed in the notebooks environment and Orchest.output() helps us in exporting the output of a current step to the next step and it is necessary to link the next step using the arrow that we get in the pipeline interface.    

Step 2: Data preprocessing

This step consists of the codes for preprocessing the data that we get in the first step. The following image is of the second step.

Under this step, we can find the following codes 

import orchest
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
data = orchest.get_inputs() 
X, y = data["data"]
scaler = MinMaxScaler()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
orchest.output((X_train, y_train, X_test, y_test), name="training_data") 

In this step, we have used the orchest.get_input() module to import the data from the previous step and split the data and again passed the split data for the next step.

Step 3: Defining model

In this step, we can find that we have combined three steps. We have modelled the data from preprocessed steps using three models ( logistic regression, decision tree, random forest). The below image is the representation of this step.

The code we have pushed in these steps is similar. We just changed the model so in the below I have posted the codes of only one model.

import numpy as np
import orchest
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error
# Retrieve the data from the previous step.
data = orchest.get_inputs()
X_train, y_train, X_test, y_test = data["training_data"]
model = LogisticRegression()
model.fit(X_train, y_train)
from sklearn.metrics import accuracy_score
y_pred = model.predict(X_test)
test_accracy = accuracy_score(y_test, y_pred)
Orchest.output(test_accracy, name="logistic-regression-accuracy")

In the above codes, we can see how the data is collected from the previous step and how the final results from the model are pushed for the next step.

Step 4: Collecting accuracy  

After defining and fitting the model step 4 is our final step that will collect the accuracy of all the models. The below image is the image of our final pipeline. 

Codes in the final steps are as follows:

import orchest
data = orchest.get_inputs()
for name, value in data.items():
    if name != "unnamed":
        print(f"\n{name:30} {value}")


Here we can see the final output. We can also check this using the log button in the pipeline interface.

Here we can also get the information if any component of the pipelines needs bug fixes or has errors.

Final words 

In this article, we have discussed what Orchest is and found that is an easy way to build the machine learning pipeline with it. We also looked at an example that can be followed for building machine learning pipelines. After logging in you can see this example pipeline using this link.


More Great AIM Stories

Yugesh Verma
Yugesh is a graduate in automobile engineering and worked as a data analyst intern. He completed several Data Science projects. He has a strong interest in Deep Learning and writing blogs on data science and machine learning.

Our Upcoming Events

Masterclass, Virtual
How to achieve real-time AI inference on your CPU
7th Jul

Masterclass, Virtual
How to power applications for the data-driven economy
20th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, Virtual
Deep Learning DevCon 2022
29th Oct

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM

What can SEBI learn from casinos?

It is said that casino AI technology comes with superior risk management systems compared to traditional data analytics that regulators are currently using.

Will Tesla Make (it) in India?

Tesla has struggled with optimising their production because Musk has been intent on manufacturing all the car’s parts independent of other suppliers since 2017.