Ultimate Guide To Loss functions In PyTorch With Python Implementation

pytorch loss functions

Have you ever wondered how we humans evolved so much? – because we learn from our mistakes and try to continuously improve ourselves on the basis of those mistakes now the same case is with machines, just like humans machines can also tend to learn from their mistakes but how? – In neural networks & AI, we always give freedom to algorithms to find the best prediction but one can not improve without comparing it with its previous mistakes, hence comes the Loss function in the picture. 

Loss functions are the mistakes done by machines if the prediction of the machine learning algorithm is further from the ground truth that means the Loss function is big, and now machines can improve their outputs by decreasing that loss function. Earlier we used the loss functions algorithms manually and wrote them according to our problem but now libraries like PyTorch have made it easy for users to simply call the loss function by one line of code. 

Today we will be discussing the PyTorch all major Loss functions that are used extensively in various avenues of Machine learning tasks with implementation in python code inside jupyter notebook. Now According to different problems like regression or classification we have different kinds of loss functions, PyTorch provides almost 19 different loss functions.

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Loss function

loss function or cost function is a function that maps an event or values of one or more variables onto a real number intuitively representing some “cost” associated with the event. An optimization problem seeks to minimize a loss function. An objective function is either a loss function or its negative (in specific domains, variously called a reward function, a profit function, a utility function, a fitness function, etc.), in which case it is to be maximized.

neural network loss fucntion

Getting started

You can try the tutorial below in Google Colab, it comes with a preinstalled major data science package, including PyTorch.

import torch
loss = torch.nn.L1Loss()

To run PyTorch locally into your machine you can download PyTorch from here according to your build: https://pytorch.org/get-started/locally/

Torch is a Tensor library like NumPy, with strong GPU support, Torch.nn is a package inside the PyTorch library. It helps us in creating and training the neural network. Read more about torch.nn here

Jump straight to the Jupyter Notebook here

1. Mean Absolute Error (nn.L1Loss)

1. Mean Absolute Error (nn.L1Loss), mse

It is the simplest form of error metric. Mean Absolute Error(MAE) measures the numerical distance between predicted and true value by subtracting and then dividing it by the total number of data points. MAE is a linear score metric. Let’s see how to calculate it without using the PyTorch module.

Algorithmic way of find loss Function without PyTorch module

import numpy as np
y_pred = np.array([0.000, 0.100, 0.200])
y_true = np.array([0.000, 0.200, 0.250])
# Defining Mean Absolute Error loss function
def mae(pred, true):
    # Find absolute difference
    differences = pred - true
    absolute_differences = np.absolute(differences)
    # find the absolute mean
    mean_absolute_error = absolute_differences.mean()
    return mean_absolute_error
mae_value = mae(y_pred, y_true)
print ("MAE error is: " + str(mae_value))

With PyTorch module(nn.L1Loss)

import torch
mae_loss = torch.nn.L1Loss()
input = torch.tensor(y_pred)
target = torch.tensor(y_true)
output = mae_loss(input, target)
1. Mean Absolute Error (nn.L1Loss)

2. Mean Squared Error (nn.L2Loss)

MSE, 2. Mean Squared Error (nn.L2Loss)

Like, Mean absolute error(MAE), Mean squared error(MSE) sums the squared paired differences between ground truth and prediction divided by the number of such pairs.

MSE loss function is generally used when larger errors are well-noted, But there are some cons like it also squares up the units of data. Which makes an evaluation with different units not at all justified.

Mean-Squared Error using PyTorch

target = torch.randn(3, 4)
mse_loss = nn.MSELoss()
output = mse_loss(input, target)
print('input -: ', input)
print('target -: ', target)
print('output -: ', output)
2. Mean Squared Error (nn.L2Loss)

3. Binary Cross Entropy(nn.BCELoss)

This loss metric creates a criterion that measures the BCE between the target and the output. Also with binary cross-entropy loss function, we use the Sigmoid activation function which works as a squashing function and hence limits the output to a range between 0 and 1.

BInary cross Entropy

Using Binary Cross Entropy loss function without Module

y_pred = np.array([0.1580, 0.4137, 0.2285])
y_true = np.array([0.0, 1.0, 0.0]) #2 labels: (0,1)
def BCE(y_pred, y_true):
    total_bce_loss = np.sum(-y_true * np.log(y_pred) - (1 - y_true) * np.log(1 - y_pred))
    # Getting the mean BCE loss
    num_of_samples = y_pred.shape[0]
    mean_bce_loss = total_bce_loss / num_of_samples
    return mean_bce_loss
bce_value = BCE(y_pred, y_true)
print ("BCE error is: " + str(bce_value))
BInary cross Entropy

Binary Cross Entropy(BCELoss) using PyTorch

bce_loss = torch.nn.BCELoss()
sigmoid = torch.nn.Sigmoid() # Ensuring inputs are between 0 and 1
input = torch.tensor(y_pred)
target = torch.tensor(y_true)
output = bce_loss(input, target)
BInary cross Entropy

4. BCEWithLogitsLoss(nn.BCEWithLogitsLoss)

It adds a Sigmoid layer and the BCELoss in one single class. This provides numerical stability for log-sum-exp. It is more numerically stable  than using a plain Sigmoid followed by a BCELoss.

target = torch.ones([10, 64], dtype=torch.float32)  # 64 classes, batch size = 10
output = torch.full([10, 64], 1.5)  # A prediction (logit)
pos_weight = torch.ones([64])  # All weights are equal to 1
criterion = torch.nn.BCEWithLogitsLoss(pos_weight=pos_weight)
criterion(output, target)  # -log(sigmoid(1.5))

5. Negative Log-Likelihood Loss(nn.NLLLoss)

The negative log likelihood loss is mostly used in classification problems, here Likelihood refers to the chances of some calculated parameters producing some known data.

input = torch.randn(3, 5, requires_grad=True)
# every element in target should have value(0 <= value < C)
target = torch.tensor([1, 0, 4])
m = nn.LogSoftmax(dim=1)
nll_loss = nn.NLLLoss()
output = nll_loss(m(input), target)
print('input -: ', input)
print('target -: ', target)
print('output -: ', output)

6. PoissonNLLLoss (nn.PoissonNLLLoss)

This loss represents the Negative log likelihood loss with Poisson distribution of target, below is the formula for PoissonNLLLoss.

import torch.nn as nn
loss = nn.PoissonNLLLoss()
log_input = torch.randn(5, 2, requires_grad=True)
target = torch.randn(5, 2)
output = loss(log_input, target)

7. Cross-Entropy Loss(nn.CrossEntropyLoss)

Image for post

Cross-Entropy loss or Categorical Cross-Entropy (CCE) is an addition of the Negative Log-Likelihood and Log Softmax loss function, it is used for tasks where more than two classes have been used such as the classification of vehicle Car, motorcycle, truck, etc.

The above formula is just the generalization of binary cross-entropy with an additional summation of all classes: j

input = torch.randn(3, 5, requires_grad=True)
target = torch.empty(3, dtype=torch.long).random_(5)
cross_entropy_loss = nn.CrossEntropyLoss()
output = cross_entropy_loss(input, target)
print('input: ', input)
print('target: ', target)
print('output: ', output)

8 Hinge Embedding Loss(nn.HingeEmbeddingLoss)


Hinge Embedding loss is used for calculating the losses when the input tensor:x, and a label tensor:y values are between 1 and -1, Hinge embedding is a good loss function for binary classification problems.

target = torch.randn(3, 5)
hinge_loss = nn.HingeEmbeddingLoss()
output = hinge_loss(input, target)
print('input -: ', input)
print('target -: ', target)
print('output -: ', output)

9. Margin Ranking Loss (nn.MarginRankingLoss)


Margin Ranking Loss computes the criterion to predict the distances between inputs. This loss function is very different from others, like MSE or Cross-Entropy loss function.

This function can calculate the loss provided there are inputs X1, X2, as well as a label tensor, y containing 1 or -1. When the value of y is 1 the first input will be assumed as the larger value and will be ranked higher than the second input. Similarly if y=-1, the second input will be ranked as higher. It is mostly used in ranking problems.

first_input = torch.randn(3, requires_grad=True)
Second_input = torch.randn(3, requires_grad=True)
target = torch.randn(3).sign()

ranking_loss = nn.MarginRankingLoss()
output = ranking_loss(first_input, Second_input, target)
print('input one: ', first_input)
print('input two: ', Second_input)
print('target: ', target)
print('output: ', output)

10. Smooth L1Loss

Image for post

It is also known as Huber loss, uses a squared term if the absolute error goes less than1, and an absolute term otherwise. SmoothL1 loss is more sensitive to outliers than the other loss functions like mean square error loss and in some cases, it can also prevent exploding gradients.

sample, target = dataset[i]
target_predicted = model(sample)
loss = torch.nn.L1Loss() 
loss_value = loss(target, target_predicted)

11. Triplet Margin Loss Function(nn.TripletMarginLoss)

The Triplet Margin Loss function is used to determine the relative similarity existing between the samples, and it is used in content-based retrieval problems.

This function can calculate the loss when there are input tensors: x1, x2, x3, as well as margin with a value greater than zero a triplet consists of: an anchor: a, positive examples: p, and negative examples:n

anchor = torch.randn(100, 128, requires_grad=True)
positive = torch.randn(100, 128, requires_grad=True)
negative = torch.randn(100, 128, requires_grad=True)

triplet_margin_loss = nn.TripletMarginLoss(margin=1.0, p=2)
output = triplet_margin_loss(anchor, positive, negative)

print('anchors -: ', anchor)
print('positive -: ', positive)
print('negative -: ', negative)
print('output -: ', output)

12. Kullback-Leibler divergence(nn.KLDivLoss)

Also known as the KL divergence loss function is used to compute the amount of lost information in case the predicted outputs are utilized to estimate the expected target prediction.

It outputs the proximity of two probability distributions If the value of the loss function is zero, it implies that the probability distributions are the same.


Kullback-Leibler divergence behaves mostly like the Cross-Entropy Loss function, the only difference is Cross entropy punishes the model on basis of confidence of predictions, and KL Divergence doesn’t!

input = torch.randn(2, 3, requires_grad=True)
target = torch.randn(2, 3)

kld_loss = nn.KLDivLoss(reduction = 'batchmean')
output = kld_loss(input, target)

print('input tensor: ', input)
print('target tensor: ', target)
print('Loss: ', output)

Wrapping Up

That’s it we covered all the major PyTorch’s loss functions, and their mathematical definitions, algorithm implementations, and PyTorch’s API hands-on in python.

The Working Notebook of the above Guide is available at here You can find the full source code behind all these PyTorch’s Loss functions Classes here. Some of the loss functions which we didn’t cover in this tutorial, you can learn more about their usage from the below references:

Mohit Maithani
Mohit is a Data & Technology Enthusiast with good exposure to solving real-world problems in various avenues of IT and Deep learning domain. He believes in solving human's daily problems with the help of technology.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox