Hands-On Tutorial On Machine Learning Pipelines With Scikit-Learn

Hands-On Tutorial On Machine Learning Pipelines With Scikit-Learn .In this article, I’ll be discussing how to implement a machine learning pipeline using scikit-learn.

With increasing demand in machine learning and data science in businesses, for upgraded data strategizing there’s a need for a better workflow to ensure robustness in data modelling. Machine learning has certain steps to be followed namely – data collection, data preprocessing(cleaning and feature engineering), model training, validation and prediction on the test data(which is previously unseen by model). 

Here testing data needs to go through the same preprocessing as training data. For this iterative process, pipelines are used which can automate the entire process for both training and testing data. It ensures reusability of the model by reducing the redundant part, thereby speeding up the process. This could prove to be very effective during the production workflow.


(Source: YouTube – Pydata )

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

In this article, I’ll be discussing how to implement a machine learning pipeline using scikit-learn.

Advantages of using Pipeline:

  • Automating the workflow being iterative.
  • Easier to fix bugs 
  • Production Ready
  • Clean code writing standards
  • Helpful in iterative hyperparameter tuning and cross-validation evaluation

Challenges in using Pipeline:

  • Proper data cleaning
  • Data Exploration and Analysis
  • Efficient feature engineering

Scikit-Learn Pipeline

The sklearn.pipeline module implements utilities to build a composite estimator, as a chain of transforms and estimators.

I’ve used the Iris dataset which is readily available in scikit-learn’s datasets library. The 6 columns in this dataset are: Id, SepalLength(in cm), SepalWidth(in cm), PetalLength(in cm), PetalWidth(in cm), Species(Target). 50samples containing 3 classes-Iris setosa, Iris Virginica, Iris versicolor.


After loading the data, split it into training and testing then build pipeline object wherein standardization is done using StandardScalar() and dimensionality reduction using PCA(principal component analysis) both of these with be fit and transformed(these are transformers), lastly the model to use is declared here it is LogisticRegression, this is the estimator. The pipeline is fitted and the model performance score is determined.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
                     ('pca1',PCA(n_components=2)),                     ('lr_classifier',LogisticRegression(random_state=0))])
model = pipeline_lr.fit(X_train, y_train)

OUTPUT - 0.8666666666666667

With the pipeline, we preprocess the training data and fit the model in a single line of code. In contrast, without a pipeline, we have to do normalization, dimensionality reduction, and model training in separate steps. This becomes especially messy if we have to deal with both numerical and categorical variables.

Use the following two lines of code inside the Pipeline object for filling missing values and change categorical values to numeric. (Since iris dataset doesn’t contain these we are not using)

('imputer', SimpleImputer(strategy='most_frequent')) #filling missing values

(‘onehot', OneHotEncoder(handle_unknown='ignore'))    #convert categorical 

Make sure to import OneHotEncoder and SimpleImputer modules from sklearn!

Stacking Multiple Pipelines to Find the Model with the Best Accuracy

We build different pipelines for each algorithm and the fit to see which performs better.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
pipeline_svm = Pipeline([('scalar3', StandardScaler()),
                      ('pca3', PCA(n_components=2)),
                      ('clf', svm.SVC())])
pipelines = [pipeline_lr, pipeline_dt, pipeline_randomforest, pipeline_knn]
pipe_dict = {0: 'Logistic Regression', 1: 'Decision Tree', 2: 'Support Vector Machine',3:'K Nearest Neighbor'}
for pipe in pipelines:
  pipe.fit(X_train, y_train)
for i,model in enumerate(pipelines):
    print("{} Test Accuracy:{}".format(pipe_dict[i],model.score(X_test,y_test)))
Logistic Regression Test Accuracy: 0.8666666666666667
Decision Tree Test Accuracy: 0.9111111111111111
Support Vector Machine Test Accuracy: 0.9333333333333333
K Nearest Neighbor Test Accuracy: 0.9111111111111111

From the results, it’s clear that Support Vector Machines(SVM) perform better than other models.

Hyperparameter Tuning in Pipeline

With pipelines, you can easily perform a grid-search over a set of parameters for each step of this meta-estimator to find the best performing parameters. To do this you first need to create a parameter grid for your chosen model. One important thing to note is that you need to append the name that you have given the classifier part of your pipeline to each parameter name. In my code above I have called this ‘randomforestclassifier’ so I have added randomforestclassifier__ to each parameter. Next, I created a grid search object which includes the original pipeline. When I then call fit, the transformations are applied to the data, before a cross-validated grid-search is performed over the parameter grid.

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
pipe = make_pipeline((RandomForestClassifier()))
grid_param = [
{"randomforestclassifier": [RandomForestClassifier()],
"randomforestclassifier__n_estimators":[10,100,1000],                    "randomforestclassifier__max_depth":[5,8,15,25,30,None],                 "randomforestclassifier__min_samples_leaf":[1,2,5,10,15,100],
"randomforestclassifier__max_leaf_nodes": [2, 5,10]}]
gridsearch = GridSearchCV(pipe, grid_param, cv=5, verbose=0,n_jobs=-1) 
best_model = gridsearch.fit(X_train,y_train)

OUTPUT - 0.9777777777777777


This is a basic pipeline implementation. In real-life data science, scenario data would need to be prepared first then applied pipeline for rest processes. Building quick and efficient machine learning models is what pipelines are for. Pipelines are high in demand as it helps in coding better and extensible in implementing big data projects. Automating the applied machine learning workflow and saving time invested in redundant preprocessing work.
The complete code of the above implementation is available at the AIM’s GitHub repository. Please visit this link to find the notebook with codes.

Jayita Bhattacharyya
Machine learning and data science enthusiast. Eager to learn new technology advances. A self-taught techie who loves to do cool stuff using technology for fun and worthwhile.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox