Now Reading
Everything About Pipelines In Machine Learning and How Are They Used?

Everything About Pipelines In Machine Learning and How Are They Used?

Rohit Dwivedi
W3Schools

In machine learning, while building a predictive model for classification and regression tasks there are a lot of steps that are performed from exploratory data analysis to different visualization and transformation. There are a lot of transformation steps that are performed to pre-process the data and get it ready for modelling like missing value treatment, encoding the categorical data, or scaling/normalizing the data. We do all these steps and build a machine learning model but while making predictions on the testing data we often repeat the same steps that were performed while preparing the data. 

So there are a lot of steps that are followed and while working on a big project in teams we can often get confused about this transformation. To resolve this we introduce pipelines that hold every step that is performed from starting to fit the data on the model. 

Through this article, we will explore pipelines in machine learning and will also see how to implement these for a better understanding of all the transformations steps.



What we will learn from this article? 

  • What are the pipelines in Machine learning?
  •  Advantages of building pipelines? 
  • How to implement a pipeline?
  1. What are the pipelines in Machine learning? 

Pipelines are nothing but an object that holds all the processes that will take place from data transformations to model building. Suppose while building a model we have done encoding for categorical data followed by scaling/ normalizing the data and then finally fitting the training data into the model. If we will design a pipeline for this task then this object will hold all these transforming steps and we just need to call the pipeline object and rest every step that is defined will be done. 

This is very useful when a team is working on the same project. Defining the pipeline will give the team members a clear understanding of different transformations taking place in the project. There is a class named Pipeline present in sklearn that allows us to do the same. All the steps in a pipeline are executed sequentially. On all the intermediate steps in the pipeline, there has to be a first fit function called and then transform whereas for the last step there will be only fit function that is usually fitting the data on the model for training. 

As soon as we fit the data on the pipeline, the pipeline object is first transformed and then fitted on each of the steps. While making predictions using the pipeline, all the steps are again repeated except for the last function of prediction. 

  1. How to implement a pipeline?

Implementation of the pipeline is very easy and involves 4 different steps mainly that are listed below:- 

  • First, we need to import pipeline from sklearn
  • Define the pipeline object containing all the steps of transformation that are to be performed. 
  • Now call the fit function on the pipeline.
  • Call the score function to check the score.

Let us now practically understand the pipeline and implement it on a data set. We will first import the required libraries and the data set. We will then split the data set into training and testing sets followed by defining the pipeline and then calling the fit score function. Refer to the below code for the same.

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
df = pd.read_csv('pima.csv')
X = df.values[:,0:7]
Y = df.values[:,8]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.30, random_state=7)
pipe  = Pipeline([('sc',StandardScaler()),('rfcl', RandomForestClassifier())])

We have defined the pipeline with the object name as pipe and this can be changed according to the programmer. We have defined sc objects for StandardScaler and rfcl for Random Forest Classifier. 

pipe.fit(X_train,y_train)

print(pipe.score(X_test, y_test)

If we do not want to define the objects for each step like sc and rfcl for StandardScaler and Random Forest Classifier since there can be sometimes many different transformations that would be done. For this, we can make use of make_pipeling that can be imported from the pipeline class present in sklearn. Refer to the below example for the same. 

from sklearn.pipeline import make_pipeline

pipe = make_pipeline(StandardScaler(),(RandomForestClassifier()))

We have just defined the functions in this case and not the objects for these functions. Now let’s see the steps present in this pipeline. 

See Also
Healthcare cyber-security

print(pipe.steps)

pipe.fit(X_train,y_train)

print(pipe.score(X_test, y_test))

Conclusion 

Through this article, we discussed pipeline construction in machine learning. How these can be helpful while different people working on the same project to avoid confusion and get a clear understanding of each step that is performed one after another. We then discussed steps for building a pipeline that had two steps i.e scaling and the model and implemented the same on the Pima Indians Diabetes data set. At last, we explored one other way of defining a pipeline that is building a pipeline using make a pipeline.

What Do You Think?

If you loved this story, do join our Telegram Community.


Also, you can write for us and be one of the 500+ experts who have contributed stories at AIM. Share your nominations here.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top