Most of the supervised learning problems in machine learning are classification problems. Classification is the task of assigning a data point with a suitable class. Suppose a pet classification problem. If we input certain features, the machine learning model will tell us whether the given features belong to a cat or a dog. Cat and dog are the two classes here. One may be numerically represented by 0 and the other by 1. This is specifically called a binary classification problem. If there are more than two classes, the problem is termed a multi-class classification problem. This machine learning task comes under supervised learning because both the features and corresponding class are provided as input to the model during training. During testing or production, the model predicts the class given the features of a data point.
This article discusses Logistic Regression and the math behind it with a practical example and Python codes. Logistic regression is one of the fundamental algorithms meant for classification. Logistic regression is meant exclusively for binary classification problems. Nevertheless, multi-class classification can also be performed with this algorithm with some modifications.
Sign up for your weekly dose of what's up in emerging technology.
Define a Binary Classification Problem
Create the environment by importing necessary libraries and modules.
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn import metrics sns.set_style('darkgrid')
Load a binary classification problem from SciKit-Learn’s in-built datasets. The breast cancer data is a binary classification problem with two classes. Download the data and metadata using the following code.
raw_data = load_breast_cancer() raw_data.keys()
We can read more about the loaded data using the DESCR file.
A portion of the output:
The dataset contains 30 features and one target. Target has two classes: Malignant (cancerous state) and Benign (non-cancerous state).
Create a pandas dataframe for the features and a pandas series for the target.
data = pd.DataFrame(raw_data['data'], columns=raw_data['feature_names']) target = pd.Series(raw_data['target'], name='target')
For more clarity, we proceed with only five selected features.
features = ['mean radius', 'mean texture', 'mean smoothness', 'mean compactness', 'mean concavity'] X = data[features] y = target.copy() X.head()
X has all the features and y has the target. If we model with all of the available data, we could not evaluate our model. Hence, it is mandatory to split the available data into training and validation sets. The training set is used to train the model and the validation set will be used to evaluate the trained model.
X_train, X_val, y_train, y_val = train_test_split(X,y, test_size=0.2, random_state=6)
80% of the available data is randomly assigned to the training set and the remaining 20% to the validation set. Random state helps reproduce the results.
The Math behind Logistic Regression
Before proceeding with modeling and training, we should understand the concepts and math behind the Logistic Regression method. In general, when there is a smooth and continuous change in one or more features, a machine learning model gives a continuously varying output. But a binary classification problem needs discrete outputs, either 0 or 1. There is no straightforward approach to obtain either 0 or 1 as the output. In this scenario, Logistic Regression implements a math function called Logit that helps push the output close to 0 or 1. In other words, Logistic Regression generates continuous outputs whose values lie between 0 and 1, but most of them are close to the bounding values.
Logit is a linear function that is the same as the output of a Linear Regression model. It is the arithmetic summation of the weighted sum of the features and bias. Bias and weights are also called the Intercept and coefficients, respectively. For instance, our X data has five features. The Logit function can be defined as:
Once Logit is calculated, it should be parsed to a probability distribution that pushes most of the values towards either 0 or 1.
If the P(y) value is above 0.5, the class assigned to the data point is 1. Else, if it is below 0.5, the class assigned is 0. Thus, linear function P(y) is transformed into discrete values (y) by force.
Logistic Regression using statsmodels Library
Logistic Regression can be performed using either SciKit-Learn library or statsmodels library. However, the above math concepts can be explored clearly with statsmodels.
from statsmodels.api import Logit, add_constant # add intercept manually X_train_const = add_constant(X_train) # build model and fit training data model_1 = Logit(y_train, X_train_const).fit() # print the model summary model_1.summary()
Output (useful information are highlighted):
The bias and coefficients of the Logit function are calculated by the Logit Regression using Maximum Likelihood Estimation (MLE). The coefficients in the above output are the bias and the five weights respectively.
The probability distribution of the logit function for training data can be obtained and visualized using the following codes.
# Probability Distribution for Training data prob_train = model_1.predict(X_train_const) # sort the prob dist for visualization sorted_train = sorted(prob_train.values) index_train = np.arange(len(sorted_train)) # plot it plt.plot(index_train, sorted_train, '+r') plt.title('Training Data: Probability Distribution', size=14, color='orange') plt.xlabel('Examples (sorted by output value)') plt.ylabel('Probability of Logit function') plt.show()
It can be observed that the probability values are pushed close to either 0 or 1. Most of the points are close to 0 or 1, while a few points make the shift from 0 to 1. Moreover, the shift from 0 to 1 is sudden. It helps the model make decisions with more confidence. By default, 0.5 is the decision boundary (or technically called the threshold). Even if this threshold is shifted a little above or below, hardly any point will be differently classified. Let’s predict the probability distribution for the validation data and plot it.
# Probability Distribution for Validation data X_val_const = add_constant(X_val) prob_val = model_1.predict(X_val_const) # sort the prob dist for visualization sorted_val = sorted(prob_val.values) index_val = np.arange(len(sorted_val)) plt.plot(index_val, sorted_val, '+g') plt.title('Validation Data: Probability Distribution', size=14, color='orange') plt.xlabel('Examples (sorted by output value)') plt.ylabel('Probability of Logit function') plt.show()
Because of this continuous transition of predicted values from 0 to 1, Logistic Regression is called so, but not Logistic Classification.
Let’s perform classification using the probability distribution. Define 0.5 as threshold and classify data points either as 0 or 1.
threshold = 0.5 y_pred = (prob_val > threshold).astype(np.int8)
Evaluate the model using Accuracy score.
A classification report is a handy method that yields evaluation based on multiple metrics in a class-wise manner. To get familiar with various classification metrics, one can refer to this article.
The confusion matrix may give a better insight on performance.
conf = pd.DataFrame(metrics.confusion_matrix(y_val,y_pred), index=['Actual Malignant', 'Actual Benign'], columns=['Predicted Malignant', 'Predicted Benign']) conf
It is observed that totally 9 data points are misclassified among 114.
We can try different threshold values manually to check model performance.
accuracies =  thresholds = np.arange(0.0, 1.01, 0.05) for th in thresholds: y_preds = (prob_val > th).astype(np.int8) acc = metrics.accuracy_score(y_val,y_preds) accuracies.append(acc) # plot the accuracy values plt.plot(thresholds, accuracies, '*m') plt.xlabel('Threshold') plt.ylabel('Accuracy') plt.show()
The visualization clearly expresses that the change in threshold value does not greatly impact the accuracy. Any threshold value in between 0.2 and 0.8 can produce an accuracy above 90%. Moreover, the plot exploits that the maximum accuracy is obtained for a threshold value at around 0.7.
Using SciKit-Learn Library
Logistic Regression is performed with a few lines of code using the SciKit-Learn library.
from sklearn.linear_model import LogisticRegression model_2 = LogisticRegression(penalty='none') model_2.fit(X_train, y_train)
Evaluate the model with validation data. Infer predictions with X_train and calculate the accuracy.
y_pred_2 = model_2.predict(X_val) metrics.accuracy_score(y_val, y_pred_2)
Explore the classification report to get class-wise insight on Precision, Recall and F1-score.
Probability distribution can also be obtained using the following code.
A portion of the output:
This library yields class-wise probability distribution (the bigger one among two is the predicted class).
We have performed Logistic regression with two libraries. Let’s compare the predictions of both libraries.
# y_pred is the prediction of statsmodels library # y_pred_2 is the prediction of sklearn libray # Compare both libraries (y_pred == y_pred_2).all()
Both libraries perform identically because the underlying math is common for any library.
This notebook contains the above code implementation.
This article discussed Logistic Regression, the mathematical concepts involved in it, and its implementation with a famous binary classification problem. We have further explored how different threshold values can affect classification performance and discussed why the method has the name Logistic Regression rather than Logistic Classification.
Interested readers can explore various options available in the methods by referring to the official documentation and try solving their own classification problems. Further, readers would attempt to solve multi-class classification problems using the Logistic Regression method.